Selected ideal natural ligand against TNBC by inhibiting CDC20, using bioinformatics and molecular biology

Object: Find potential therapeutic targets of triple-negative breast cancer (TNBC) patients by bioinformatics. Screen ideal natural ligand that can bind with the potential target and inhibit it by using molecular biology. Methods: Bioinformatics and molecular biology were combined to analyze potential therapeutic targets. Differential expression analysis identified the differentially expressed genes (DEGs) between TNBC tissues and non-TNBC tissues. The functional enrichment analyses of DEGs shown the important gene ontology (GO) terms and pathways of TNBC. Protein-protein interaction (PPI) network construction screened 20 hub genes, while Kaplan website was used to analyze the relationship between the survival curve and expression of hub genes. Then Discovery Studio 4.5 screened ideal natural inhibitors of the potential therapeutic target by LibDock, ADME, toxicity prediction, CDOCKER and molecular dynamic simulation. Results: 1,212 and 353 DEGs were respectively found between TNBC tissues and non-TNBC tissues, including 88 up-regulated and 141 down-regulated DEGs in both databases. 20 hub genes were screened, and the higher expression of CDC20 was associated with a poor prognosis. Therefore, we chose CDC20 as the potential therapeutic target. 7,416 natural ligands were conducted to bind firmly with CDC20, and among these ligands, ZINC000004098930 was regarded as the potential ideal ligand, owing to its non-hepatotoxicity, more solubility level and less carcinogenicity than the reference drug, apcin. The ZINC000004098930-CDC20 could exist stably in natural environment. Conclusion: 20 genes were regarded as hub genes of TNBC and most of them were relevant to the survival curve of breast cancer patients, especially CDC20. ZINC000004098930 was chosen as the ideal natural ligand that can targeted and inhibited CDC20, which may give great contribution to TNBC targeted treatment.

related to TNBC, and this kind of TNBC patients was especially platinum-sensitive [8]. However, it is still unsatisfactory for patients. Therefore, further study of TNBC molecular mechanisms is still an urgent work. Also, molecular biology was a hot topic in drug development, and molecular docking and virtual screen were widely used in drug design. By using these methods, we can calculate the pharmacological properties of these ligands [9]. Meanwhile, many natural ligands can be used as lead compounds and converted into new drugs after modi cation [10]. It can be seen as the rst step of drug development. For example, Zhong et al. study suggested that ZINC000003938684 and ZINC000014811844 natural ligands were ideal potential inhibitors of PARP targeting than Olaparib [11]. And Xie et al. found lead compound for the treatment of Alzheimer's disease by virtual screening [12].
In this study, we combine the bioinformatics with molecular biology to nd new way to treat TNBC patients.
Firstly, we downloaded GSE62931 and GSE76275 databases and got the differentially expressed genes (DEGs) between TNBC tissues and normal tissues through these two databases. Then we performed functional and pathway enrichment analysis of these DEGs. Meanwhile, we built protein-protein interaction (PPI) network and analyzed the functions of these DEGs to get 20 most important hub genes. We also evaluated the association of genes expression levels with breast cancer prognosis. Finally, we chose one potential target of TNBC through these 20 hub genes and screened the ideal natural ligands that can combine with it and inhibit it by computational study. In short, this study's frame diagram was demonstrated in Fig. 1, and this study provided a candidate drug to treat TNBC patients.

Microarray data
We downloaded GSE62931 and GSE76275 databases, which contains 365 samples, from Gene Expression Omnibus (GEO) website [13][14][15]. We divided these 2 databases into 2 groups, TNBC tissues and normal tissues. GSE62931 contains 47 TNBC tissues and 53 normal tissues, while GSE76275 contains 198 TNBC tissues and 67 normal tissues. Meanwhile, we transformed the data of these 2 databases in order to get standardized data, then we compared and analyzed these 2 groups.

Identi cation and analysis of DEGs
We used Limma library in R studio and input the code to get the differentially expressed genes (DEGs). The log Fold Change was higher than or equal to 2 and the adjust P-value was less than 0.05. Then we used Morpheus website to make the Heat maps of these 2 databases [16]. Meanwhile, we used the Limma library again and input the relevant code to get the Volcano plots. These gures were performed with P-value < 0.05 was de ned. The log change > 0.5 folds of genes were regarded as up-regulated DEGs and log change < -0.5 folds of genes were regarded as down-regulated DEGs, while others were not-signi cant DEGs. We also labeled the most signi cant 20 DEGs in Volcano plots. We also get Venn plots to get the DEGs which existed in both GSE62931 and GSE76275 databases [17].

Functional and pathway enrichment analyses of DEGs
We used metascape website to make the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of DEGs [18]. Then we respectively put up-regulated and down-regulated DEGs into Database for Annotation, Visualization and Integrated Discovery (DAVID) website [19]. DAVID website is a database of biological information and an online free analysis software. It provided users extract biological information from a large list of genes or proteins. And The GO analysis included biological processes (BP), molecular functions (MF) and cellular components (CC) of DEGs.

PPI network construction and selection of module
We used the Search Tool for the Retrieval of Interacting Genes (STRING) website and cytoscape software to build protein-protein interaction (PPI) network [20]. Each node in STRING website represents a protein. In STRING website, different isoforms produced by the same gene were combined, and the letter marked on the node is actually the gene symbol of the corresponding gene. The lines between nodes represent interactions between two proteins, and different colors correspond to different types of interactions.

Prognostic analysis of the most important 20 DEGs
We used Kaplan website to analyze the survival curve [21]. This website is constructed based on gene microarray and RNA-seq data from public databases such as TCGA and GEO. And it assessed the survival impact of more than 50000 genes across 21 types of cancer, including breast cancer. The Kaplan website integrates gene expression information and clinical prognostic value for meta-analysis and the study, discovery and validation of molecular markers related to survival. We input the 20 hub genes and got the overall survival (OS) and recurrence-free survival (RFS) of these 20 hub genes in breast cancer.
Ligand database and the crystal structure of CDC20 We downloaded the natural products database and the chemical structure of apcin (Protein Data Bank identi er: ZINC000008434966) from ZINC website, which is a free virtual screening database of commercial compounds [22]. We screened ideal lead compounds form this database, which contains 17931 ligands.
Meanwhile, we downloaded the crystal structure of CDC20 (4N14) from RCSB PDB website [23]. The RCSB PDB website is powered by the Protein Data Bank archive-information that provided researchers all aspects of biomedicine and agriculture [24].

ADME and toxicity prediction
We employed the ADME and the TOPKAT to calculate absorption, distribution, metabolism, excretion (ADME) and the toxicity of these compounds, by analyzing chemical structure. These modules were very signi cant to analyze the safety of ligands, which can save a lot of manpower and material resources [25].

LibDock and CDOCKER molecular docking
The LibDock module was a simple and fast molecular docking method, which was used to screen large-scale data. While CDOCKER module was a precise docking method, which based on CHARMm exible docking program. We imported the crystal structure of CDC20 to the Discovery Studio and removed the crystal water and other heteroatoms from it, added, protonation, hydrogen energy minimization and ionization to it. And the apcin's binding region of CDC20 was chosen as the binding site. Then we ran the LibDock and got 7,416 ligands and their LibDock scores. The top 20 ligands were listed based on the LibDock score [26].

Molecular dynamic simulation
We selected the ligand-CDC20 complexes' best binding conformations among these poses by the molecule docking program. We simulated the physiologic environment by adding sodium chloride to the system, which was relaxed by energy minimization in the CHARMm force eld. We can use the molecular dynamic simulation to get the potential energy and the RMSD in natural environment.

Identi cation of DEGs
Firstly, 2 databases, GSE62931 and GSE76275 were downloaded from GEO website [27,28]. Then we got 1,212 and 353 DEGs respectively between TNBC tissues and normal tissues. Heat maps of these two databases' DEGs expression were shown in Fig

Functional and pathway enrichment analyses
We used metascape website to make the functional and pathway enrichment analysis (Fig. 3B, 3C and 3D and Supplementary Table 1). The enrichments of DEGs were mostly in 'epithelial cell differentiation', 'developmental growth', 'regulation of hormone levels', 'urogenital system development' and 'regulation of growth'. Figure 3C and 3D were respectively colored by cluster ID and P-value.
Meanwhile, we respectively put the up-regulated and down-regulated DEGs into DAVID website, and made the functional enrichment analysis again (Supplementary Fig. 1B  PPI network construction and the selection of module STRING and cytoscape software were used to build PPI network (Fig. 4A). Based on the degrees, we get the top 20 genes as the hub genes, including MELK, CDC20, EZH2, TYMS, MCM10, BUB1B, FOXM1, ASPM, TTK, TPX2, NDC80, PRC1, CEP55, NUF2, ANLN, FOXA1, AR, SKA1, FAM64A and DEPDC1B (Table 1). Then we selected the most important 3 modules (Supplementary Table 2). The module 1 contains 18 genes, including CDC20. The module 2 contains 5 genes and the module 3 contains 8 genes. We also made the functional enrichment analysis of these 3 modules and got the Supplementary Fig. 2. The genes of module 1 were mainly enriched in 'sister chromatid cohesion', 'cell division' and 'mitotic nuclear division', while genes of module 2 were mostly enriched in 'intermediate lament', 'keratin lament' and 'structural molecule activity'. And as for module 3, genes were mostly enriched in 'transcription factor binding', 'transcriptional activator activity' and 'transcription regulatory region DNA binding'.

Survival analysis of hub genes
We put these 20 hub genes into Kaplan website and performed the survival curve analysis. The results illustrated that the overall survival (OS) of TNBC patients with higher expression 14 hub genes were shorter than those with lower expression 14 hub genes (P < 0.01), especially CDC20 (Fig. 4B). Likewise, the recurrence-free survival (RFS) of TNBC patients with lower expression 18 hub genes were longer than TNBC patients with higher expression 18 hub genes (P < 0.01), especially CDC20 (Fig. 4C). On the contrary, the RFS and OS of TNBC patients with higher expression AR genes were longer than those with lower expression AR genes (Supplementary material). In short, most of hub genes were associated with poor prognosis of TNBC patients. Cell-division cycle protein 20 homologue (CDC20) as one of the most signi cant hub gene and an important gene in module 1, was shown to be a great therapeutic target of TNBC patients [29]. Therefore, we chose CDC20 as targeted site for further study.

ADME and toxicity properties of CDC20 inhibited ligands
We downloaded a natural database from ZINC database, which contains 17,931 ligands, to screen potential CDC20 targeted inhibitor. Meanwhile, we chose apcin, a CDC20 targeted inhibitor, as the reference drug [30].
We also downloaded the crystal structure of CDC20 protein and chemical structure of reference drug, apcin ( Fig. 5A and Supplementary Fig. 3A). 7,416 ligands were indicated to bind rmly with CDC20 protein through the LibDock screening. We listed the top 20 ligands in Supplementary Table 3 based on the LibDock score.
Then we analyzed the Pharmacologic properties of these top 20 ligands by ADME and Toxicity Prediction (Tables 2 and 3). Among these ligands, owing to the non-hepatotoxicity, more solubility level and less carcinogenicity than apcin, ZINC000004098930 was selected as the lead compounds for further study.

Ligand-binding site analysis and Ligand Pharmacophore
We use the CDOCKER module of Discovery Studio to assess the ligand binding mechanisms of ZINC000004098930 and apcin with CDC20 ( Fig. 5D and 5F). And the CDOCKER potential energy of ZINC000004098930 (-29.9471 kcal/mol) was lower than apcin (-25.8556 kcal/mol), which means ZINC000004098930 could bind more rmly than apcin ( Table 4). The results also shown the charge, the hydrogen bonds and the π-related interactions between these ligands and CDC20 (Fig. 5E, 5G and Supplementary Table 4). It is obvious that ZINC000004098930 had 2 hydrogen bonds and 4 π-related interactions with CDC20, while apcin had only 3 π-related interactions with CDC20.  Fig. 3C and 3D).

Molecular dynamic simulation
Stability was very signi cant in drug development, so we use molecular dynamic simulation module to analyze the stability of ZINC000004098930-CDC20 and apcin-CDC20 complexes. We got the potential energy and RMSD curves of these complexes by molecular docking experiment ( Fig. 5H and 5I). After 18 ps, the ZINC000004098930-CDC20 and apcin-CDC20 complexes' trajectories reached equilibrium, and gradually being stabilized with time going by. In conclusion, these two complexes could be stable in natural circumstances.

Discussion
Triple-negative breast cancer (TNBC) as a poor prognosis disease, attracted more and more people's concern and attention. Owing to lack HR, ER and HER2, TNBC still do not have available target therapy options [31]. In clinical practice, chemotherapy still the main treatment to TNBC patients [32]. Recently, targeted therapy has been a hot topic. And new targeted site and new targeted drugs had been proved very bene cial for tumor patients [33]. Higher response rates were seen when targeted inhibitors are combined with chemotherapy [34].
Therefore, it is very bene cial for us to nd new targeted site and new potential ideal targeted drugs.
In this study, we combined bioinformatics with molecular biology to provide new ideas for TNBC treatments.
Firstly, we analyzed GSE62931 and GSE76275 database, which contains 245 TNBC tissues and 120 normal tissues. Trough the Heat maps and Volcano plots, we got 1,212 and 353 differentially expressed genes (DEGs) form these 2 databases. And the Venn plot showed that there are 299 DEGs in both GSE62931 and GSE76275 database, including 88 up-regulated and 141 down-regulated genes. These DEGs could be regarded as potential biomarkers and targeted site for TNBC.
Then, we used metascape website to make the functional enrichment analysis, to study molecular pathways in TNBC. The gures and tables indicated that DEGs were mostly enriched in 'developmental growth', 'regulation of hormone level' and 'epithelial cell differentiation'. Also, we respectively analyzed the upregulated and down-regulated DEGs by functional enrichment in DAVID, and results illustrated that the enrichment of up-regulated DEGs was mainly in BP, MF, CC terms, such as 'mitotic nuclear division', 'identical protein binding' and 'cytoplasm'. As for down-regulated DEGs, they were also enriched in BP, MF and CC terms, including 'negative regulation of cell proliferation', 'heme binding' and 'extracellular exosome'. In addition, in KEGG pathways, the enrichments of up-regulated and down-regulated DEGs were respectively mostly in 'cell cycle' and 'PPAR signaling pathway'. For example, M. Rath et al. demonstrated that mitotic nuclear division was associated with tumorigenesis, and mitotic kinesins were being validated as drug targets [35,36]. And extracellular exosome is a vesicle released into extracellular region. Some studies had shown that tumor cells can produce more exosomes than normal cell [37]. Therefore, in short, these pathways were all contribute to the progression of TNBC.
PPI network analysis was very important in bioinformatics research. By the STRING and cytoscape software, we got the top 20 DEGs as the hub genes based on the degrees. We also got the most signi cant 3 modules, which could be regarded as the most important gene clusters of TNBC. Among these, module 1 included CDC20 and other 17 genes. We also got the functional pathway enrichment analysis of these 3 modules.
DEGs in module 1 were mostly enriched in the BP, CC and KEGG pathway, including 'spindle', 'sister chromatid cohesion', 'cell division', 'cell cycle' and 'mitotic nuclear division'. CDC20 plays a great role in these 5 GO terms and pathway. Meanwhile, most of the DEGs in these 3 modules were enriched in the MF and BP of GO terms, including 'transcription regulatory region DNA binding' and 'structural molecule activity' and 'mitotic nuclear division'. On the one hand, it is obvious that abnormal transcription and mitosis could lead to tumorigenesis [38,39]. On the other hand, there were some studies demonstrated that inhibition of the cellular machinery required for the assembly and maintenance can inhibit the tumor growth [40,41]. Therefore, these terms and pathways maybe new therapeutic targets for TNBC.
In addition, in Kaplan website, we found 14 hub genes were relevant to OS of TNBC patients, and 19 hub genes were relevant to RFS of TNBC patients. Higher expression of most of them contributed to shorter lifetime, including CDC20, ANLN, ASE1, ASPM, CEP55 and so on. Among these hub genes, cell-division cycle protein 20 homologue (CDC20), as the second important gene based on the degrees in cytoscape software and the signi cant gene in module 1, took part in cell division [42]. CDC20, which was key to chromosome segregation and mitosis exit, plays an important role in cell cycle progressing [43]. It can activate a ligase, the anaphase-promoting complex/cyclosome (APC/C), which starts the anaphase and mitotic exit [44]. Cheng et al. shown that overexpression of CDC20 promotes the metastasizing of breast cancer [45]. Meanwhile, some study con rmed that overexpression of CDC20 lead to short-term breast cancer survival again [46]. It was also proved that CDC20 was a great target for anti-tumor drug development [29]. Therefore, CDC20 was a potential treatment target, and we chose CDC20 as a targeted spot for further study.
It is obvious that nding effective CDC20 inhibitor was very important for breast cancer targeted therapy.
There were many CDC20 targeted drugs, including tosyl-L-arginine methyl ester (TAME) and apcin, but they still have many issues to be addressed [29]. TAME was proved that can inhibit the binding of free CDC20 and APC and promote the CDC20 removal from the APC [47,48]. And apcin can bind to CDC20 and simultaneously disrupt the APC/C-Cdc20-substrate ternary complex by competitively inhibition to blockade the mitotic exit [49]. It was also proved that apcin can inhibit the growth and invasion of osteosarcoma cell by targeting CDC20 [50]. However, whether TAME and apcin were useful in clinical needs further investments. Among these CDC20 inhibitor, we chose apcin as the reference drug to screen new potential ideal compounds for TNBC patients.
We got 7,416 natural ligands by LibDock, and based on the LibDock score, we chose top 20 ligands to do the further study. Safety is one of the most important things in drug development. Therefore, after analyzing their biochemico-pharmacological properties by ADME and Toxicity Prediction module, we chose ZINC000004098930, which was non-hepatotoxicity, more solubility level and less carcinogenicity than apcin, as the safe lead compounds among the top 20 ligands.
Then we analyzed the pharmacophore and the ligand binding mechanisms of ZINC000004098930 and apcin with CDC20. The results demonstrated that the CDOCKER potential energy of ZINC000004098930 was lower than apcin, which means ZINC000004098930 could bind more rmly than apcin. Meanwhile, ZINC000004098930 had more hydrogen bonds and the π-related interactions with CDC20 than apcin. In general, ZINC000004098930 had a higher binding force with CDC20 than apcin.
In the end, we run RMSD and calculated the potential energy of ZINC000004098930-CDC20 and apcin-CDC20 complexes to study the stability of them by molecular dynamics simulation. As the results suggested, the trajectories of both ZINC000004098930-CDC20 and apcin-CDC20 complexes reached their equilibrium after 18 ps. They become gradually stabilized, which indicated these two complexes could exist stability in natural.
In conclusion, ZINC000004098930 could be regarded as ideal lead compounds for drug development for TNBC patients and may give new thoughts to TNBC targeted therapy.
Recently, targeted therapy is a hot topic for tumor treatment, but we still do not have perfect drugs for TNBC treatment. In this study, we combined bioinformatics with molecular biology to screen a new ideal ligand, which targeted inhibit CDC20. Although there is a long way from clinical application, it provided a new way to treat TNBC. ZINC000004098930 as a natural ligand has unique advantages. To sum up, we did the rst step of drug development for TNBC patients. And what's more, we provided 18 else hub genes and many targeted pathways, which may be useful in future study.

Conclusions
We found 229 DEGs between TNBC tissues and normal tissues, including 88 up-regulated and 141 downregulated DEGs. 20 hub genes were screened and most of them were relevant to the survival time of breast cancer patients. Therefore, we chose CDC20, which plays a great role in TNBC, as the potential target. We screened 7,416 natural ligands that can bind rmly with CDC20 from ZINC database. And among these ligands, ZINC000004098930 was regarded as the potential ideal ligand, owing to its non-hepatotoxicity, more solubility level and less carcinogenicity than the reference drug, apcin. Meanwhile, ZINC000004098930-CDC20 was proved could exist stably in natural environment. In short, ZINC000004098930 may be the ideal targeted ligand after modi cation, which may give great contribution to TNBC targeted treatment. Availability of data and materials

List Of Abbreviations
The breast cancer microarrays GSE62931 and GSE76275 datasets can be accessed from the Gene Expression Omnibus (GEO) website (https://www.ncbi.nlm.nih.gov/geo/).

Competing interests
The authors declare that they have no competing interests.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.